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In  {refaction 


Our  sensory  systems  are  capable  of  representing  a  vast  number  of  possible  stimuli.  Our 
environment  presents  us  with  only  a  small  fraction  of  the  possibilities;  this  selected  subset  is 
characterised  by  many  regularities.  Our  minds  encode  these  regularities,  and  this  gives  us  some 
ability  to  infer  the  probable  current  condition  of  unknown  portions  of  the  environment  given 
some  limited  information  about  the  current  state.  What  kind  of  regularities  exist  in  the 
environment  and  how  should  they  be  encoded? 

This  paper  presents  preliminary  results  of  research  founded  on  the  hypothesis  that  in  real 
environments  there  exist  regularities  that  can  be  idealised  as  mathematical  structures  that  are 
simple  enough  to  be  analysaUe.  Only  die  simplest  kind  of  regularity  is  considered  here:  I  will 
assume  that  the  environment  contains  modules  (objects)  that  recur  exactly,  with  various  states 
of  the  environment  being  comprised  of  various  combinations  of  these  modules.  Even  this  sim¬ 
plest  kind  of  environmental  regularity  offers  interesting  learning  problems  and  results.  It  also 
serves  to  introduce  a  general  framework  capable  of  treating  mote  subtle  types  of  regularities. 
And  the  problem  considered  is  an  important  one,  for  the  delineation  of  modules  at  one  level  of 
conceptual  representation  is  a  major  step  in  die  construction  of  higher  level  representations. 

To  analyze  the  encoding  of  modularity  of  the  environment,  I  will  proceed  in  three  steps. 
First,  I  will  describe  a  general  information  processing  task,  completion,  for  a  cognitive  system. 
Then  I  will  describe  the  entities,  schemas,  I  use  for  encoding  the  environmental  modules  and 
discuss  how  they  are  used  by  the  cognitive  system  to  perform  its  task.  Finally  I  will  offer  a  cri¬ 
terion  for  how  the  encoding  of  the  environment  into  schemas  should  be  done.  The  presenta¬ 
tion  will  be  informal;  more  precise  statements  of  definitions  and  results  appear  in  the  Appendix. 

Coupletleas  and  Schemas 

The  task  I  consider  here,  completion,  is  the  inferring  of  missing  information  about  the 
current  state  of  the  environment.  The  cognitive  system  is  given  a  partial  description  of  that 
state  as  input,  and  must  produce  as  output  a  completion  of  that  description. 

The  entities  used  to  represent  the  environmental  modules  are  called  schemas.  A  given 
schema  represents  die  hypothesis  that  a  given  module  is  present  in  the  current  state  of  the 
environment.  When  the  system  is  given  a  partial  description  of  the  current  state  of  the 
environment,  the  schemas  look  at  the  information  to  see  if  they  fit  it;  those  that  do  become 
active.  When  inference  from  the  given  information  to  the  missing  information  is  possible,  it  is 
because  some  of  the  schemas  represent  modules  that  incorporate  both  given  and  unknown 
information.  Such  a  schema  being  active,  Le.,  the  belief  that  the  module  is  present  in  the 
current  environmental  state,  permits  inferences  about  the  missing  information  pertaining  to  the 
module.  (Thus  if  the  given  information  about  a  word  is  ALG#*#THM,  the  schema  for  the 
module  algorithm  is  activated,  and  inferences  about  the  missing  letters  are  possible.) 

There  is  a  problem  here  with  the  sequence  of  decision-making.  How  can  the  schemas 
decide  if  they  fit  the  current  situation  when  that  situation  is  only  partially  described?  It  would 
appear  that  to  really  assesa  the  relevance  of  a  schema,  the  system  would  first  have  to  fill  in  the 
missing  information.  But  to  decide  on  the  missing  portions,  the  system  first  needs  to  formulate 
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beliefs  concerning  which  modules  am  present  in  the  current  state.  I  call  this  the  schema  /  infer¬ 
ence  decision  problem’,  what  is  desired  is  a  way  of  circumventing  this  problem  that  is  general  and 
extremely  simple  to  describe  mathematically. 

Harm esty  and  Computational  Temperature 

Before  *"g  the  algorithm  used  to  tackle  this  problem,  let  us  first  try  to  characterize 
what  outcome  we  would  like  the  algorithm  to  produce.  The  approach  I  am  pursuing  assumes 
that  she  ben  inference  about  die  missing  information  is  the  one  that  best  satisfies  the  most  hypotheses 
(schemas).  That  is,  consider  all  possible  responses  of  the  system  to  an  input  (a  partially 
specified  environmental  state).  Each  such  response  involves  (a)  a  decision  about  exactly  which 
schemas  are  active,  and  (b)  a  decision  about  how  to  specify  aU  the  missing  information.  Given 
such  a  response,  I  propose  a  measure  of  the  internal  consistency  that  computes  the  degree  to 
which  the  hypotheses  represented  by  the  active  schemas  are  satisfied  by  the  input  and  output.  I 
call  this  measure  the  harmony  function  H.  Responses  characterized  by  greater  internal  con¬ 
sistency  —  greater  harmony  —  are  'better.'  (The  definition  of  If  is  simple,  but  requires  the  for¬ 
malism  given  in  die  Appendix.) 

Given  that  better  responses  are  characterized  by  greater  harmony,  the  obvious  thing  to  do 
is  hill-climb  in  harmony.  However  this  method  is  apt  to  get  stuck  at  local  maxima  that  are  not 
global  maxima.  We  therefore  temporarily  relax  our  desire  to  go  directly  for  the  'best'  response, 
instead  we  consider  a  stochastic  system  that  gives  different  responses  with  different  probabilities: 
the  better  the  response,  dm  more  likely  the  system  is  to  give  it.  The  degree  of  spread  in  the 
probability  distribution  is  denoted  T ;  for  high  values  of  T ,  the  distribution  is  widely  spread, 
with  the  better  responses  being  only  slightly  mote  likely  than  less  good  ones:  for  low  values  of 
T ,  the  best  response  is  much  more  likely  than  the  others.  Thus  T  measures  the  'randomness' 
in  the  system;  I  call  it  die  computational  temperature.  When  we  want  only  good  responses,  we 
must  achieve  low  temperature. 

A  general  stochastic  algorithm  can  be  derived  that  realizes  this  probabilistic  response;  it 
provides  a  method  for  computer  simulation.  A  parallel  relaxation  method  is  used  to  resolve  the 
schema/ inference  derision  problem  in  a  way  that  involves  no  sophisticated  control. 

The  variables  of  the  system  are  the  activations  of  all  the  schemas  (1  =  active,  0  =  inac¬ 
tive)  and  the  bits  of  missing  information.  The  system  starts  with  (a)  all  schemas  inactive,  and 
(b)  completely  random  guesses  for  all  bits  of  missing  information.  Then,  a  schema  or  a  bit  of 
missing  information  is  selected  randomly  as  the  variable  to  be  inspected;  it  will  now  be  assigned 
a  (possibly  new)  value  of  1  or  0.  Using  the  current  guesses  for  aU  the  other  variables,  the  har¬ 
mony  function  is  evaluated  for  the  two  possible  states  of  the  selected  variable.  These  two 
numbers,  H  (1)  and  H  (0),  measure  the  overall  consistency  for  die  two  cases  where  the  selected 
variable  is  assigned  1  and  0;  they  can  be  computed  because  tentative  guesses  have  been  made 
for  all  the  schema  activations  and  missing  information.  Next  a  random  choice  of  these  1,0 

states  is  made,  with  the  probability  of  the  choices  1  and  0  having  ratio  *  ~jT(0)7r~  • 

(The  reason  for  the  exponential  function  is  given  in  the  Appendix.) 

This  random  process  —  pick  a  schema  or  bit  of  missing  information;  evaluate  the  two  har¬ 
monies;  pick  a  state  —  is  iterated,  hi  the  theoretical  limit  that  the  process  continues 
indefinitely,  it  can  be  proved  diet  the  probability  that  the  system  gives  any  response  is 
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proportional  to  eBfT,  where  H  ia  the  harmony  of  that  response.  This  probability  distribution 
satisfies  the  qualitative  description  of  response  probabilities  we  set  out  to  realize. 

CeoUng  the  System 

hi  this  algorithm,  each  schema  activation  or  bit  of  missing  information  is  determined  ran¬ 
domly;  moat  likely  the  value  with  higher  harmony  is  chosen,  but  sometimes  not.  Thus  most  of 
the  time  die  changes  raise  the  harmony,  but  not  always.  The  higher  the  temperature,  the  more 
random  are  the  decisions,  that  is,  the  more  often  the  changes  go  'downhill.*  Thus,  this  algo¬ 
rithm,  unlike  strict  hill-climbing,  does  not  get  stuck  at  local  maxima.  However,  there  is  a  tra¬ 
deoff.  The  higher  the  temperature,  the  faster  the  system  escapes  local  maxima  by  going 
downhill;  but  the  higher  the  temperature,  the  more  random  is  the  motion,  and  the  more  time 
the  system  spends  in  states  of  low  harmony.  Eventually,  to  be  quite  sure  the  system’s  response 
has  high  harmony,  the  temperature  must  be  low. 

As  the  computation  proceeds,  die  optimization  point  of  this  tradeoff  shifts.  Initially,  the 
guesses  for  the  missing  information  are  completely  random,  so  the  information  on  which  sche¬ 
mas  ate  determining  their  relevance  is  unreliable.  It  is  therefore  desire  able  to  have  consider¬ 
able  randomness  in  the  decision  making,  i.e.,  high  temperature.  Even  at  high  temperature, 
however,  the  system  is  more  likely  to  occupy  states  of  higher  harmony  than  lower,  so  the 
guesses  become  more  reliable  than  their  completely  random  start.  At  this  point  it  makes  sense 
for  the  schemas  to  be  somewhat  less  random  in  their  activity  decisions,  so  the  temperature 
should  be  lowered  a  bit.  This  causes  the  system  to  spend  more  of  its  time  in  states  of  higher 
harmony,  justifying  a  further  decrease  in  temperature.  And  so  the  temperature  should  be  gra¬ 
dually  lowered  to  achieve  the  desired  final  condition  of  low  temperature.  (A  major  portion  of 
future  work  will  be  analysis  of  how  to  regulate  the  cooling  of  the  system.) 

As  the  computation  proceeds  and  the  temperature  drops,  the  system’s  initially  rough  and 
scattered  response  becomes  progressively  more  accurate  and  consistent.  This  is  just  the  kind  of 
computation  typical  in  people  and  just  the  kind  needed  in  any  large  parallel  system,  where  each 
sub-system  needs  a  constant  stream  of  input  from  the  others. 

Cognitive  Crystallization 

As  computation  proceeds,  does  accuracy  increase  slowly  and  steadily,  or  does  the  system 
undergo  sudden  and  dramatic  changes  in  behavior,  as  do  physical  systems  when  they  are  cooled 
past  critical  temperatures  marking  phase  transitions?  This  question  has  been  addressed  both 
through  computer  simulation  and  analytic  approximation  of  a  two-choice  decision.  The  system 
has  two  schemas  representing  conflicting  interpretations  of  the  environment.  The  approximate 
theory  allows  computation  of  the  probabilities  of  various  completions  given  an  input  that  partly 
describes  the  environment.  It  is  useful  to  ask,  what  is  the  completion  of  a  completely  ambiguous 
Input?  Far  high  temperatures,  as  one  might  expect,  the  completions  form  random  mixtures  of 
die  two  interpretations;  the  system  does  not  choose  either  interpretation.  However,  below  a 
certain  * freezing  temperature  ’  the  system  adopts  one  of  the  interpretations.  Each  interpretation  is 
equally  likely  to  be  selected.  The  computer  simulation  approximates  this  behavior;  below  the 
freezing  point,  it  flips  back  and  forth  between  the  two  interpretations,  occupying  each  for  a  long 
time. 
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Slowly  vacillating  interpretation  of  genuinely  ambiguous  input  is  a  familiar  but  not  particu¬ 
larly  important  feature  of  human  cognition.  What  is  significant  here  is  that  even  when  the  input 
provides  no  help  whosever  In  selecting  an  interpretation,  the  system  eventually  (when  cooled 
mfftiendy)  abandons  meaningless  mixtures  of  Interpretations  and  adopts  some  coherent  interpretation. 
A  robust  tendency  to  form  coherent  interpretations  is  important  both  for  modelling  human  cog¬ 
nition  and  for  building  intelligent  machines.  The  above  analysis  suggests  that  in  processing  typ¬ 
ical  inputs,  which  are  at  most  partially  ambiguous,  as  processing  continues  and  the  temperature 
drops,  die  system  wanders  randomly  through  an  ever-narrowing  range  of  approximate  solutions 
until  some  time  when  the  system  freezes  into  an  answer. 

Schema  Selection 

Having  discussed  how  schemas  are  used  to  do  completions,  it  is  time  to  consider  what  set 
of  schemas  ought  to  be  used  to  represent  the  regularities  in  a  given  environment.  Suppose  the 
system  experiences  a  set  of  states  of  the  environment  and  from  these  it  must  choose  its  sche¬ 
mas.  Call  these  states  die  training  set.  Since  the  schemas  are  used  to  try  to  construct  high- 
harmony  responses,  a  reasonable  criterion  would  seem  to  be:  the  best  schemas  are  those  that  per¬ 
mit  the  greatest  total  harmony  far  responses  to  the  training  set. 

I  call  this  the  training  harmony  criterion.  I  will  not  discuss  here  an  algorithm  for  finding 
the  best  schemas.  Instead,  I  shall  present  some  elementary  but  non-obvious  implications  of 
this  very  simple  criterion.  To  explore  these  implications,  we  create  various  idealized  environ¬ 
ments  each  displaying  some  interesting  modularity.  Choosing  a  training  set  from  within  an 
environment,  we  see  whether  the  training  harmony  criterion  allows  the  system  to  induce  the 
modularity  from  the  training  set,  by  choosing  schemas  that  encode  the  modularity. 

Perceptual  Grouping 

We  perceive  scenes  not  as  wholes,  nor  as  vast  collections  of  visual  features,  but  as  collec¬ 
tions  of  objects.  Is  there  some  general  characterization  at  what  is  natural  about  the  particular 
levels  of  grouping  that  form  these  'objects*?  This  question  can  be  addressed  at  a  simple  but 
abstract  level  by  considering  an  environment  of  strings  of  four  letters  in  some  fixed  font.  In 
this  idealized  environment,  the  modules  ('objects')  are  letter  tokens,  and  in  various  states  of 
the  environment  they  recur  exactly:  each  letter  always  appears  in  exactly  the  same  form.  The 
location  of  each  of  the  four  letters  is  absolutely  fixed;  I  call  the  location  of  the  first  letter  the 
'first  slot,*  and  so  forth.  The  environment  consists  of  all  combinations  of  four  letters. 

Now  consider  a  computer  given  a  subset  of  these  four- letter  strings  as  training;  call  these 
the  training  words.  The  image  the  computer  gets  of  each  training  word  is  just  a  string  of  bits, 
each  bit  representing  whether  some  portion  of  the  image  is  on  or  off.  The  machine  does  not 
know  that  "bit  42*  and  *bit  67*  represent  adjacent  places;  all  bits  are  spatially  meaningless. 
Could  the  machine  possibly  induce  from  the  training  that  certain  bits  'go  together*  in  determin¬ 
ing  one  of  the  letters,  for  example,  those  bits  that  we  know  represent  the  first  slot?  Is  there 
some  sense  in  which  schemas  for  the  letter  A  in  the  first  slot,  and  so  on,  are  natural  encodings 
of  this  environment? 
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As  an  obvious  alternative,  for  example,  the  system  could  simply  create  one  schema  for 
each  training  word.  Or  it  could  create  a  schema  for  each  bit  in  the  image.  These  are  the  two 
extreme  cases  of  maximally  big  and  small  schemas;  the  letter  schemas  fall  somewhere  in 
between.  Which  of  these  three  cases  is  best?  The  training  harmony  criterion  implies  that  letter 
schemas  an  best,  provided  the  training  set  and  number  of  bits  per  letter  are  not  too  small. 

This  result  can  be  abstracted  from  the  reading  context  in  which  it  was  presented  for  expo¬ 
sitory  convenience;  the  mathematical  result  does  not  depend  upon  the  interpretation  we  place 
upon  the  modules  with  which  it  deals.  Thus  the  result  can  be  characterized  more  abstractly  as 
providing  evidence  that  natural  schemas  encoding  the  modules  of  an  environment  are  inducible  by 
the  training  harmony  criterion,  provided  the  modules  recur  exactly.  This  investigation  must  now 
be  extended  to  cases  in  which  the  recurrence  of  the  modules  is  in  some  sense  approximate. 

Roles  vs.  Instances 

When  should  experience  be  encoded  as  a  list  of  instances  and  when  as  a  collection  of 
rules?  To  address  this  issue  we  consider  two  environments  that  are  special  subsets  of  the 
four-letter  environment  considered  above: 


Environment  R 

FAME 

VEMB 

SlliB 

FAND 

VEND 

SIND 

FARP 

VERP 

SIRP 

FALT 

VELT 

SILT 

ZOUB 

FARE 

ZOND 

FENP 

ZORP 

FIMT 

ZOLT 

FOLD 

Environment  I 
VAMP  SALT  ZAND 
VELB  SEMD  ZERT 
VIED  SINE  ZILP 
VONT  SOUP  ZOUB 


In  the  highly  regular  environment  R,  there  are  strict  rales  such  as  *F  is  always  followed  by  A'; 
in  the  irregular  environment  I,  no  such  rales  exist.  Note  that  here,  schemas  for  the  'rules*  of 
environment  R  are  just  digraph  schemas: 

FA —  — ,  VE—  — ,  .  .  . ,  —  —t P,  —  —LT | 
schemas  for  'instances*  arc  whole-word  schemas. 

Qtae  might  hope  that  a  criterion  for  schema  selection  would  dictate  that  environment  R  be 
encoded  in  digraph  schemas  representing  the  rules  while  environment  /  be  encoded  in  word 
schemas  representing  instances.  The  training  harmony  criterion  implies  that  for  the  regular  environ¬ 
ment.  digraph  schemas  an  better  than  word  schemas;  for  the  irregular  environment,  it  is  the  reverse. 
(In  each  case  the  entire  environment  is  taken  as  the  training  set.) 

Higher  Level  Analyses 

The  framework  described  here  is  capable  of  addressing  more  sophisticated  learning  issues, 
hi  particular,  it  is  well-suited  to  analyzing  the  construction  of  higher-level  representations  and 
considering,  for  example,  the  value  of  hierarchical  organization  (which  is  not  put  into  the  sys¬ 
tem,  but  may  come  out  in  appropriate  environments).  In  addition  to  addressing  issues  of 
schema  selection  at  the  more  perceptual  levels  considered  here,  the  framework  can  be 
employed  at  higher  conceptual  levels.  The  selection  of  temporal  scripts,  for  example,  can  be 
considered  by  taking  the  environment  to  be  a  collection  of  temporally  extended  episodes.  The 
simulation  method  described  here  for  systems  with  given  schemas  can  also  be  applied  at  higher 
levels;  it  is  being  explored,  for  example,  for  use  in  text  comprehension. 
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APPENDIX:  The  Formal  Framework  of  Harmony  Theory 

hi  the  following  general  discussion,  the  specifics  of  the  four- letter  grouping  environment 
discussed  in  the  text  are  presented  in  parentheses. 

The  possible  beliefs  B  of  the  cognitive  system  about  the  current  state  of  the  environment 
form  a  space  B.  This  space  is  assumed  to  have  a  set  p  of  binary  coordinates  p;  every  belief  B  is 
defined  by  a  set  of  bits  B(p)  in  {+1.-1},  one  for  each  p  c  p.  (Each  p  represents  a  distinct 
pixel.  B(p)  is  +1  if  the  system  believes  p  is  on,  -1  if  off.  This  belief  can  come  from  inference 
or  input.)  An  input  /  to  the  system  is  a  set  of  binary  values  /(p)  for  some  of  the  p  c  p;  an  output 
is  a  set  of  binary  values  0(p)  for  the  remaining  p.  Together,  /  and  O  form  a  complete  belief 
state  B,  a  completion  of  /. 

A  schema  S  is  defined  by  a  value  J(p)  s  {+1,-1, 0}  for  each  pep.  (If  pixel  p  does  not  lie 
in  the  first  slot,  the  schema  5  for  'the  letter  A  in  the  first  slot*  has  S(p)  *  0.  If  p  does  lie  in 
the  first  slot,  then  S  (p)  is  ±  1  according  to  whether  the  pixel  is  on  or  off.)  If  for  some  particu¬ 
lar  p,  S(p)* 0,  then  that  p  is  an  argument  of  5;  the  number  of  arguments  of  S  is  denoted  1 5  I. 
(For  a  word  schema,  every  p  is  an  argument.) 

Schemas  are  used  by  the  system  to  infer  likely  states  of  the  environment.  For  a  given 
environment,  some  schemas  should  be  absent,  others  present,  with  possibly  varying  relative 
strengths  corresponding  to  varying  likelihood  in  the  environment.  A  knowledge  base  for  the 
system  is  a  function  a  that  defines  the  relative  strengths  of  all  possible  schemas:  a(S)  tn  0  and 
2j<t(5)  *  1.  (The  knowledge  bases  relevant  to  the  text  have  all  o(S)  -  0  except  for  the 
schemas  in  some  set  S  —  like  letters  —  for  which  the  strengths  are  all  equal.  These  strengths, 
then,  are  all  1/#S,  the  inverse  of  the  number  of  schemas  in  the  set.) 

A  response  of  the  system  to  an  input  I  is  a  pair  (A  JB  ),  where  B  is  a  completion  of  / ,  and 
A  defines  the  schema  actbtatbnr.  A  (5)  «  {0,1}  for  each  schema  5 . 

A  harmony  function  H  is  a  function  that  assigns  a  real  number  Hm[A  J)  to  a  response 
(A  JS ),  given  a  knowledge  base  a,  and  obeys  certain  properties  to  be  discussed  elsewhere.  A 
particular  exemplar  is 


».<*.*)  -5>(S)  A ( 5)  [2B(p)J(p) 


-  K  I  S  ll 


Here  k  is  a  constant  in  the  interval  [0,1];  it  regulates  what  proportion  «  of  a  schema’s  argu¬ 
ments  must  disagree  with  the  beliefs  in  order  for  the  harmony  resulting  from  activating  that 
schema  to  be  negative:  <  -  V4(l-tc).  In  the  following  we  assume  c  to  be  small  and  positive. 
(The  terms  without  k  make  If  simply  the  sum  over  aO  active  schemas  of  the  strength  of  the 
schema  times  the  number  of- bits  of  belief  (pixels)  that  are  consistent  with  the  schema  minus 
the  number  that  are  inconsistent.  Then  each  active  schema  incurs  a  cost  proportional  to  its 
number  of  arguments,  where  *  is  the  constant  of  proportionality.) 

The  probability  of  any  response  (A  Jl )  is  a  monotonieally  increasing  function  /  of  its  har¬ 
mony  H  (A  ,B  )•  /  i*  constrained  by  the  following  observations.  If  p\  and  p%  are  not  connected 
—  even  indirectly  —  through  the  knowledge  base  o,  then  inferences  about  pt  should  be 


«v\--.v.y  . .v 
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statistically  independent  of  those  about  p2.  In  that  case,  H  is  the  sum  of  the  harmony  contributed 
by  three  sets  of  schemas:  those  connected  to  pt>  those  connected  to  p2,  and  those  connected  to 
neither.  The  desired  statistical  independence  requires  that  the  individual  probabilities  of 
responses  for  p\  and  p2  multiply  together.  Thus  /  must  be  a  function  that  takes  additive  har¬ 
monies  into  multiplicative  probabilities.  The  only  continuous  functions  that  do  this  are  the 
exponential  functions,  a  class  that  can  be  parametrized  by  a  single  parameter  T ;  thus 

prob(A  «»  **<*•»>/*■• 

The  normalization  constant  n  is  chosen  so  that  the  probabilities  for  all  possible  responses  add 
up  to  one.  T  must  be  positive  in  order  that  the  probability  increase  with  H .  As  T  apr^iaches 
zero,  the  function  /  approaches  the  discontinuous  function  that  assigns  equal  nonzero  >trbil- 
ity  to  maximal  harmony  responses  and  zero  probability  to  all  others. 

Considerations  similar  to  those  of  the  previous  paragraph  lead  to  an  isomorphic  tula  in 
thermal  physics  (the  Boltzmann  law)  relating  the  probability  of  a  physical  state  to  «  'gy. 
The  randomness  parameter  there  is  ordinary  temperature,  and  therefore,  I  call  T  the  puta- 
tionat  temperature  of  the  cognitive  system. 

The  probability  of  a  completion  B  is  simply  the  probability  of  all  possible  responses 
(A  Jl)  that  combine  the  beliefs  B  with  schema  activations  A :  prob(fl  )  -  n  eH(A  ■* >/r . 

Monte  Carlo  analyses  of  systems  in  thermal  physics  have  proven  quite  successful  (Binder, 
1976).  Starting  from  the  exponential  probability  distribution  given  above,  the  stochastic  pro¬ 
cess  described  in  the  text  can  be  derived  (as  in  Smolensky,  1981).  The  above  formula  for  H 
leads  to  a  simple  form  for  the  stochastic  decision  algorithm  of  the  text  which  supports  the  fol¬ 
lowing  interpretation.  The  variables  can  be  represented  as  a  network  of  "p  nodes*  each  carrying 
value  B(p),  * J  nodes”  each  carrying  value  A  (5),  and  undirected  links  from  each  p  to  each  S 
carrying  label  a (5)  -  S(p)  (links  labelled  zero  can  be  omitted).  The  nodes  represent  stochastic 
processors  running  in  parallel,  continually  transmitting  their  values  over  the  links  and  asynchro¬ 
nously  setting  their  values,  each  using  only  the  labels  on  the  links  attached  to  it  and  the  values 
from  other  nodes  transmitted  over  those  links.  (The  values  l(p)  of  those  p  nodes  determined 
by  the  input  are  fixed.)  This  representation  makes  contact  with  the  neu rally- inspired  'connec- 
tivist*  approach  to  cognition  and  parallel  computation  (Hinton  and  Anderson,  1981;  McClelland 
and  Rumelhart,  1981;  Rumelhart  and  McClelland,  1982).  Independently  of  the  development  of 
harmony  theory,  Hinton  and  Sejnowski  (1983)  developed  a  closely  related  approach  to  stochas¬ 
tic  parallel  networks  following  (Hopfield,  1982)  and  (Kirkpatrick,  Gelatt,  and  Vecchi,  1983). 
From  a  noncon nection ist  artificial  intelligence  perspective,  Hofstadter  (1983)  is  pursuing  a 
related  approach  to  perceptual  grouping;  his  ideas  have  been  inspirational  for  my  work  (Hofs¬ 
tadter,  1979). 

An  environment  tor  the  cognitive  system  is  s  probability  distribution  on  B.  (AO  patterns 
of  on/  off  pixels  corresponding  to  sequences  of  four  letters  are  equally  probable;  all  other  pat¬ 
terns  have  probability  zero.)  A  training  set  T  from  this  environment  is  a  sample  of  points  T 
drawn  from  the  distribution.  (Each  T  is  a  training  word.)  A  response  A  of  the  system  to  the  set 
T  is  a  specification  of  schema  activations  A(T)  tor  each  T .  The  training  harmony  of  such  a 
response  is  BV(A,T)  *  Xr  (T),T).  The  maximum  of  B,(A,T)  over  aU  responses  A  is 
H.(T),  die  training  harmony  permitted  by  the  knowledge  base  a. 
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Each  of  the  sets  of  schemas  considered  in  the  text  (letters,  digraphs,  etc.)  tile  the  training 
set  T,  in  that  each  T  agrees  exactly  with  schemas  that  have  nonoverlapping  arguments  and  that 
together  cover  all  of  T.  All  results  cited  in  the  text  follow  from  this  elementary  calculation:  If 
the  knowledge  base  a  consists  of  a  set  of  schemas  S  that  tile  T,  then  H,(T)  =  const./#  S  where 
const,  is  a  constant  for  a  given  T.  Thus,  given  a  single  training  set  T  and  two  tilings  of  T  with 
sets  of  schemas,  the  set  with  fewer  schemas  permits  greater  training  harmony,  and  is  preferred 
by  the  training  harmony  criterion.  If  the  number  of  letters  in  the  alphabet  a  is  smaller  than  the 
number  of  pixels  per  tetter,  letters  are  a  better  encoding  than  pixels;  if  the  number  of  training 
words  exceeds  4 a  then  letters  are  better  than  words.  The  number  of  word  schemas  needed  in 
each  of  the  restricted  environments  R  and  /  is  16;  the  number  of  digraphs  needed  for  R  is  8 
while  for  /  it  is  96. 

The  calculation  cited  in  the  previous  paragraph  is  quite  simple.  Recall  that  if  the  propor¬ 
tion  of  a  schema’s  arguments  that  disagree  with  the  beliefs  exceeds  c,  then  the  harmony  result¬ 
ing  from  activating  that  schema  is  negative.  Thus  if  c  is  chosen  small  enough  (which  we 
assume),  then  for  any  given  training  word  T ,  only  those  schemas  that  match  exactly  can  contri¬ 
bute  positive  harmony.  In  the  response  A(T)  that  maximizes  the  harmony,  therefore,  only 
exactly  matching  schemas  anil  be  active;  the  others  will  be  inactive,  contributing  zero  harmony. 
Since  the  schemas  S  tile  T,  for  each  pixel  p  in  7  there  is  exactly  one  active  schema  with  p  as 
an  argument,  and  the  value  B  (p)  of  the  pixel  is  consistent  with  that  schema,  so  B (p)S (j>)  =  1. 
Thus, 

2><S>2[*0»>J0>)  ~  «'S(P)|]  *  # p(l-u)  *  #p-2e. 

Because  <r(5)  is  a  constant  for  all  S  c  S,  the  harmony  Hm(A  (T),T)  is  simply  the  previous 
expression  times  o(5)  *  1/#S,  or  2c#p/#S.  Since  this  quantity  is  identical  for  all  training 
words  7,  summing  over  aO  T  in  the  training  set  T  just  multiplies  this  by  the  size  of  T,  giving 
2c#p#T/#S  as  the  training  harmony  permitted  by  o: 

B.(T)  -  const./# S 

where  const.  *  2c#p#T,  which  is  constant  for  a  given  T. 
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